Objectives

The dataset comprises the raw RNA-seq counts from 5 breast cancer patients and 5 healthy individuals.
Aim: Perform a differential expression analysis to identify the top differentially expressed genes

Normalization for sequencing depth differences

Import Data

files <- read.table("input/TCGA_raw_counts.txt", sep = ",", header = T, row.names = 1)
datacount <- round(as.matrix(files))
groups <- c( rep("normal", 5), rep("tumour", 5))
Data <- data.frame(condition = as.factor(groups))
rownames(Data) <- colnames(datacount)

Generate DESeqDataSet and transformate read counts

DESeq.ds <-DESeqDataSetFromMatrix(countData = datacount, colData = Data, design= ~condition)
# remove genes with counts less than 10
DESeq.ds <- DESeq.ds[ rowSums(counts(DESeq.ds)) > 10, ]

To normalize read counts to account for differences in sequencing depths

# calculate the size factor and add it to the data set
DESeq.ds <- estimateSizeFactors(DESeq.ds)
sizeFactors(DESeq.ds)
##  Normal.1  Normal.2  Normal.3  Normal.4  Normal.5  Tumour.1  Tumour.2  Tumour.3 
## 0.8483932 0.7020557 0.8970659 1.3862488 0.9090357 0.8908757 1.0861023 0.9136926 
##  Tumour.4  Tumour.5 
## 1.4140354 1.0784257
counts.sf_normalized <- counts(DESeq.ds, normalized = T)

Transformation of sequencing-depth-normalized read counts

Due to the relatively large dynamic range of expression values that RNA-seq data can cover, many downstream analyses (including clustering) work much better if the read counts are transformed to the log scale following normalization. log2 is commonly used. The transformation should be done in addition to sequencing depth normalization.

Log2 transformation of read counts

# transform size-factor normalized read counts to log2 scale using a pseudocount of 1
log.norm.counts <- log2(counts.sf_normalized + 1)
par(mfrow=c(2,1)) # to plot the following two images underneath each other
# first, boxplots of non-transformed read counts (one per sample)
boxplot(counts.sf_normalized, notch = TRUE,
        main = "untransformed read counts", ylab = "read counts")
# box plots of log2-transformed read counts
boxplot(log.norm.counts, notch = TRUE,
        main = "log2-transformed read counts",
        ylab = "log2(read counts)")

# Comparison of the read distribution plots for untransformed and log2-transformed values.

Visually exploring normalized read counts

To get an impression of how similar read counts are between replicates. The counts are plotted in a pairwise manner. Example plots are showed below. It is clearly counts are more similar within replicates. Besides, for counts less than 128(2^7 =128), the variance is higher than for those with greater read counts.

# Visually exploring normalized read counts
par(mfrow=c(1,2)) # to plot the following two images underneath each other
plot(log.norm.counts[,c(1,2)], cex=.1, main = "Normal.1 vs Normal.2\nNormalized log2(read counts)")
plot(log.norm.counts[,c(1,6)], cex=.1, main = "Normal.1 vs Tumour.1 \nNormalized log2(read counts)")

Many statistical tests and analyses assume that data is homoskedastic, i.e. that all variables have similar variance. However, data with large differences among the sizes of the individual observations often shows heteroskedastic behavior. One way to visually check for heteroskedasticity is to plot the mean vs. the standard deviation.

# mean-sd plot
msd_plot <- meanSdPlot(log.norm.counts ,
                       ranks=FALSE, # show the data on the original scale 
                       plot = FALSE)
msd_plot$gg +
  ggtitle("sequencing depth normalized log2(read counts)") + ylab("standard deviation")

From the figure above, the same conclusion can be drawn that for counts less than 128(2^7 =128), the variance is higher than for those with greater read counts.

To reduce the amount of heteroskedasticity, DESeq2’s rlog() function is used. The rlog() function returns values that are both normalized for sequencing depth and transformed to the log2 scale where the values are adjusted to fit the experiment-wide trend of the variance-mean relationship.

Transformation of read counts including variance shrinkage

# obtain regularized log-transformed values
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
rlog.norm.counts <- assay(DESeq.rlog)
par(mfrow=c(1,2))
plot(rlog.norm.counts[,c(1,2)], cex=.1, main = "Normal.1 vs Normal.2 rlog \ntransformed")
plot(rlog.norm.counts[,c(1,6)], cex=.1, main = "Normal.1 vs Tumour.1 rlog \ntransformed")

# mean-sd plot for rlog-transformed data
msd_plot <- meanSdPlot(rlog.norm.counts ,
                       ranks=FALSE, # show the data on the original scale 
                       plot = FALSE)
msd_plot$gg +
  ggtitle("rlog-transformed read counts") + ylab("standard deviation")

After transformation using DESeq2’s rlog(), the amount of heteroskedasticity decreases.

Exploring global read count patterns

Hierarchical clustering

par(mfrow=c(1,1))
# cor() calculates the correlation between columns of a matrix
distance.m_rlog <- as.dist(1 - cor(rlog.norm.counts, method = "pearson" ))
# plot() can directly interpret the output of hclust()
p_h_all <- ggdendrogram( hclust(distance.m_rlog)) + 
  labs(title = "rlog transformed read counts \n distance: Pearson correlation")
p_h_all

To determine whether the different sample types can be separated in an unsupervised fashion,hierarchical clustering were used. Normal.4 is grouped incorrectly in tumour cluster. The sample should be deleted in subsequent analysis.

PCA

Principal Components Analysis (PCA) is a complementary approach to determine whether samples display greater variability between experimental conditions than between replicates of the same treatment.

P_p_all_data <- plotPCA(DESeq.rlog, returnData =T)
P_p_all <- ggplot(P_p_all_data, aes(x=PC1, y=PC2, fill = group, color = group)) + geom_point(size=3,shape=21) + theme_bw() + ggtitle("Rlog transformed counts") + geom_text(aes(y = PC2+5, x = PC1 +5, label=name))
P_p_all

Sample Normal.4 is too close to tumour samples in the the first principal component (PC1). Combining the results of Hierarchical clustering, it will be excluded from the group in the following analysis.

Exclusion of sample Normal.4

datacount <- round(as.matrix(files))
datacount.new <- datacount[, -4]
groups.new <- c( rep("normal", 4), rep("tumour", 5))
Data.new <- data.frame(condition = as.factor(groups.new))
rownames(Data.new) <- colnames(datacount.new)
DESeq.ds.new <-DESeqDataSetFromMatrix(countData = datacount.new, colData = Data.new, design= ~condition)
# remove genes without any counts
DESeq.ds.new <- DESeq.ds.new[ rowSums(counts(DESeq.ds.new)) > 10, ]
# calculate the size factor and add it to the data set
DESeq.ds.new <- estimateSizeFactors(DESeq.ds.new)
sizeFactors(DESeq.ds.new)
##  Normal.1  Normal.2  Normal.3  Normal.5  Tumour.1  Tumour.2  Tumour.3  Tumour.4 
## 0.8836634 0.7294657 0.9307081 0.9443015 0.9205729 1.1301768 0.9482753 1.4667652 
##  Tumour.5 
## 1.1164742
counts.sf_normalized.new <- counts(DESeq.ds.new, normalized = T)
log.norm.counts.new <- log2(counts.sf_normalized.new + 1)
DESeq.rlog.new <- rlog(DESeq.ds.new, blind = TRUE)
rlog.norm.counts.new<- assay(DESeq.rlog.new)

Hierarchical clustering (exclude sample Normal.4)

# cor() calculates the correlation between columns of a matrix
distance.m_rlog.new <- as.dist(1 - cor(rlog.norm.counts.new, method = "pearson" ))
# plot() can directly interpret the output of hclust()
p_h_s <- ggdendrogram( hclust(distance.m_rlog.new)) +
  labs(title = "rlog transformed read counts \n distance(remove Normal.4): \nPearson correlation")
p_h_m <- ggpubr::ggarrange(plotlist = list(p_h_all, p_h_s)) 
p_h_m

Compared the two figures above, after excluding the normal.4, all the samples are clustered in the correct group.

PCA (exclude sample Normal.4)

P_p_s_data <- plotPCA(DESeq.rlog.new, returnData =T)
P_p_s <- ggplot(P_p_s_data, aes(x=PC1, y=PC2, fill = group, color = group)) + geom_point(size=3,shape=21) + theme_bw() + ggtitle("Rlog transformed counts (remove Normal.4)") + geom_text(aes(y = PC2+5, x = PC1 +5, label=name))

p_h_m <- ggpubr::ggarrange(plotlist = list(P_p_all, P_p_s), common.legend = T) 
p_h_m

Just like the results of Hierarchical clustering, After exclusion of Normal.4, The data looks better.

Differential Gene Expression Analysis (DGE)

DESeq2 is a common tool used for different gene expression analysis and it relies on a negative binomial model to fit the observed read counts to arrive at the estimate for the difference.

DESeq2 workflow

# DESeq2 uses the levels of the condition to determine the order of the comparison
str(colData(DESeq.ds.new)$condition)
##  Factor w/ 2 levels "normal","tumour": 1 1 1 1 2 2 2 2 2
# set normal as the first -level -factor
colData(DESeq.ds.new)$condition <- relevel(colData(DESeq.ds.new)$condition , "normal")
DESeq.ds.new <- DESeq(DESeq.ds.new) 
DGE.results.new <- results(DESeq.ds.new, independentFiltering = TRUE, alpha = 0.01)
summary(DGE.results.new)
## 
## out of 17550 with nonzero total read count
## adjusted p-value < 0.01
## LFC > 0 (up)       : 1274, 7.3%
## LFC < 0 (down)     : 1307, 7.4%
## outliers [1]       : 678, 3.9%
## low counts [2]     : 670, 3.8%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Exploratory plots following DGE analysis

Histograms

# Histogram of p-values for all genes tested for no differential expression between the two conditions, tumour and normal.
hist(DGE.results.new$pvalue ,
     col = "grey", border = "white", xlab = "", ylab = "", main = "frequencies of p-values")

MA plot

# The MA plot shows the relationship between the expression change (M) and average expression strength (A)
# genes that pass the significance threshold (adjusted p-value <0.01) are colored in red
plotMA(DGE.results.new, alpha = 0.01, main = "normal vs tumour", ylim = c(-14,14),   colSig = "red") 

Heatmaps

# Heatmaps of rlog-transformed read counts for genes with adjusted p-values <0.01 in the DGE analysis.
res <- DGE.results.new %>% data.frame()
res[which(res$log2FoldChange > 2 & res$padj < 0.001),'sig'] <- 'upregulated'
res[which(res$log2FoldChange < -2 & res$padj < 0.001),'sig'] <- 'downregulated'
res[which(abs(res$log2FoldChange) <= 2 | res$padj >= 0.001),'sig'] <- 'unregulated'
res <- na.omit(res)
DGE.results.sorted <- res[order(res$padj), ]
DGEgenes <- row.names(subset(DGE.results.sorted, sig != "unregulated"))
hm.mat_DGEgenes <- log.norm.counts.new[DGEgenes , ]
aheatmap(hm.mat_DGEgenes ,
         Rowv = TRUE , Colv = TRUE ,
         distfun = "euclidean", hclustfun = "average", color = "-RdYlBu2:100",
         scale = "row") 

nrow(subset(res, sig == "upregulated"))
## [1] 506
nrow(subset(res, sig == "downregulated"))
## [1] 734

Using 2.0 FC and 0.001 padj value cut off, 506 upregulated and 734 downregulated transcripts were identified.

Volcano plot

# Volcano plot of all genes. Top 10 upregulated and log2FoldChange more than 20 as well as top 10 downregulated and log2FoldChange less than -10 are labeled.
res$gene <- rownames(res)
 res_down_label <- subset(res, sig == "downregulated") %>% arrange(., padj) %>% head(10)
 res_down_label <- rbind(res_down_label, subset(res, sig== "downregulated" & log2FoldChange < -10))
res_up_label <- subset(res, sig == "upregulated") %>% arrange(., padj) %>% head(10)
res_up_label <- rbind(res_up_label, subset(res, sig== "upregulated" & log2FoldChange > 20))
res_label <- rbind(res_down_label, res_up_label)
p <- ggplot(data = res, aes(x = log2FoldChange, y = -1*log10(padj), color = sig)) + 
  geom_point(alpha =0.5, size = 1) + 
  scale_color_manual(values = c('red', 'gray', 'blue'), limits = c('upregulated', 'unregulated', 'downregulated')) + 
  labs(x = 'log2FC', y = '- log10padj', title = 'Volcano plot', color = 'Level') + 
  theme(plot.title = element_text(hjust = 0.5, size = 14)) + geom_vline(xintercept = c(-1, 1), lty = 3, color = 'black') + geom_hline(yintercept = -log10(0.05), lty = 3, color = 'black') +
  geom_text_repel(data=res_label, aes(label=gene),col="black",alpha =0.8, min.segment.length = 0.5)
p

Downstream analyses

To evaluate the potential functions of the DEGs, enrichment analysis with the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms were further performed.

Over-representation analyses (glsoseq and REVIGO)

GO term enrichment using goseq

gene.vector <- row.names(DGE.results.new) %in% DGEgenes %>% as.integer
names(gene.vector) <- row.names(DGE.results.new)
pwf <- nullp(gene.vector, "hg19", "geneSymbol")

GO.wall <- goseq(pwf, "hg19", "geneSymbol")
go_gns <- getgo( rownames(DGE.results.new), 'hg19', 'geneSymbol') %>% stack
merge(GO.wall, go_gns, by.x = "category", by.y = "values") %>% dim
## [1] 1814157       8
subset(GO.wall, over_represented_pvalue < 0.01, 
       select = c("category","over_represented_pvalue")) %>%
write.table(.,
              file = "output/tables/Enriched_GOterms_goseq.txt",
              quote = FALSE, row.names = FALSE, col.names = FALSE)

To summarize the results, REVIGO is used, REVIGO.

new_scripts_path  <- function(script_path){
  category <- strsplit(strsplit(script_path, "_")[[1]][length(strsplit(script_path, "_")[[1]])], "\\.")[[1]][1]
  sed_cmd <- "sed -n '/^revigo\\.data.*/,/^stuff.*/p'"
  fname <- script_path
  egrep_cmd <- "egrep '^rev|^c'"
  out_fname <- paste0("output/scripts/REVIGO_myData_", category, ".r", sep="")
  system(paste(sed_cmd, fname, "|", egrep_cmd, ">", out_fname))
  return(out_fname)
}

# modify the function downloaded from the website
REVIGO_treemap <- function(revigo.data, col_palette = "Paired",
                           title = "REVIGO Gene Ontology treemap", ...){
  stuff <- data.frame(revigo.data)
  names(stuff) <- c("term_ID","description","freqInDbPercent","abslog10pvalue",
                    "uniqueness","dispensability","representative")
  stuff$abslog10pvalue <- as.numeric( as.character(stuff$abslog10pvalue) )
  stuff$freqInDbPercent <- as.numeric( as.character(stuff$freqInDbPercent) )
  stuff$uniqueness <- as.numeric( as.character(stuff$uniqueness) )
  stuff$dispensability <- as.numeric( as.character(stuff$dispensability) )
  treemap::treemap(
    stuff,
    index = c("representative","description"),
    vSize = "abslog10pvalue",
    type = "categorical",
    vColor = "representative",
    title = title,
    inflate.labels = FALSE,      
    lowerbound.cex.labels = 0,   
    bg.labels = 255,
    position.legend = "none",
    fontsize.title = 22, fontsize.labels=c(18,12,8),
    palette= col_palette, ...
  )
  
  pdf( file="revigo_treemap.pdf", width=16, height=9 ) # width and height are in inches
treemap::treemap(
    stuff,
    index = c("representative","description"),
    vSize = "abslog10pvalue",
    type = "categorical",
    vColor = "representative",
    title =title,
    inflate.labels = FALSE,      # set this to TRUE for space-filling group labels - good for posters
    lowerbound.cex.labels = 0,   # try to draw as many labels as possible (still, some small squares may not get a label)
    bg.labels = "#CCCCCCAA",     # define background color of group labels
                                                       # "#CCCCCC00" is fully transparent, "#CCCCCCAA" is semi-transparent grey, NA is opaque
    position.legend = "none"
)

dev.off()
}

Biological Process

script_path = "output/scripts/REVIGO_treemap_bp.r"
new_script_file <- new_scripts_path(script_path)
source(new_script_file)
REVIGO_treemap(revigo.data)

## png 
##   2
system(paste0("mv revigo_treemap.pdf output/figures/", "revigo_treemap_", "bp.pdf", sep=""))

Molecular Function

script_path = "output/scripts/REVIGO_treemap_mf.r"
new_script_file <- new_scripts_path(script_path)
source(new_script_file)
REVIGO_treemap(revigo.data)

## png 
##   2
system(paste0("mv revigo_treemap.pdf output/figures/", "revigo_treemap_", "mf.pdf", sep=""))

Cellular Component

script_path = "output/scripts/REVIGO_treemap_cc.r"
new_script_file <- new_scripts_path(script_path)
source(new_script_file)
REVIGO_treemap(revigo.data)

## png 
##   2
system(paste0("mv revigo_treemap.pdf output/figures/", "revigo_treemap_", "cc.pdf", sep=""))

The differential expressed genes were identified significantly enriched in nuclear division, cyclic nucleotide mediated signaling and movement of cell or sub-cellular component GO terms. Most of these terms were related to cell replication and migration which are consistent with limitless replication potential and tissue invasion, the two main characteristics of cancer cells. The presence of genes indicative of Inflammatory response which is an essential component of tumor micro-environment was also observed. GO terms tube morphogenesis and regulation of hormone levels are related to breast cancer. Interestingly, response to fatty acid was also identified, suggesting that fatty acid metabolism might play a unique role in breast cancer patient compared from healthy people.

Gene set enrichment analyses

Gene set enrichment of KEGG pathways using ClusterProfiler

DGE.results.new <- DGE.results.new[order(-1*DGE.results.new$log2FoldChange),]
gene_symbol <- row.names(DGE.results.new)
gene_id <- bitr(gene_symbol, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
DGE.results.new.subset <- subset(DGE.results.new, gene_symbol %in% gene_id$SYMBOL) 
gene_id_new <- subset(gene_id, gene_symbol %in% gene_id$SYMBOL)
genes_for_cp <- DGE.results.new.subset$log2FoldChange
names(genes_for_cp) <- gene_id_new$ENTREZID
gsea_kegg <- clusterProfiler::gseKEGG(geneList = genes_for_cp, organism = 'hsa',
                                      minGSSize = 10,
                                      pvalueCutoff = 1, verbose = FALSE)
## Dot plots
# Dot plots depict the enrichment scores and gene counts per gene set (for the most significant gene sets).
dotplot(gsea_kegg)

## Cnetplots depict the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network.
# - nodes = genes (of the top 5 most sign. GO terms by default)
# - edges = indicate whether a gene belongs to a given gene set
# - size of the GO terms = number of genes belong to a given category
cnetplot(gsea_kegg,
         colorEdge = TRUE, foldChange = genes_for_cp, node_label = "category", circle =T, showCategory = 7) +
  scale_colour_gradient2(name = "log2FC",
                         low = "navyblue", high = "red", mid = "white")

Breast cancer tissues exhibited marked enrichment in genes promoting cell proliferation, cancer, IL-17 signaling and Estrogen signaling pathways. The results are consistent with the Over-representation analyses.

genes involved in Pathways in cancer,IL-17 signaling pathway and Estrogen signaling pathway.

kegg_table <- gsea_kegg %>% data.frame()
selected_pathway <- kegg_table[c("hsa05206", "hsa04915", "hsa04657"),] ## Pathways in cancer, IL-17 signaling pathway, Estrogen signaling pathway
genes_kegg <- selected_pathway$core_enrichment
genes_pathways_in_cancer <- strsplit(genes_kegg[1], split = "/")[[1]]
genes_MicroRNAs_in_cancer <- strsplit(genes_kegg[2], split = "/")[[1]]
genes_Estrogen <- strsplit(genes_kegg[3], split = "/")[[1]]
genes_related_cancer <- c(genes_pathways_in_cancer, genes_MicroRNAs_in_cancer, genes_Estrogen)
genes_of_related_pathway <- subset(gene_id_new, ENTREZID %in% genes_related_cancer) %>%  merge(., res, by.x = "SYMBOL", by.y = "gene") %>% dplyr::select("SYMBOL", "log2FoldChange", "lfcSE", "padj", "sig") %>% subset(.,padj <0.01 & abs(log2FoldChange) >2) %>% arrange(desc(log2FoldChange))
write.table(genes_of_related_pathway, sep = "\t", file = "output/tables/genes_of_related_pathway.txt", row.names = F)
DT::datatable(genes_of_related_pathway)

Matrix metalloproteinases (MMPs), a family of zinc-dependent endopeptidases, are found inextracellular milieu of various tissues. They are involved in the degradation of extracellularmatrix (ECM). In this analysis, MMP1 and MMP13 are found ranked among the top 3 of the genes list related to cancer. A number of studies have investigated the association between MMPs expression and survival in breast cancer patients. Some researchers have identified MMP1 (PMID: 15864312, PMID: 23217186) and MMP13 (PMID: 19787229) as putative breast cancer predictive markers.

S100 Calcium Binding Protein A7(S100A7) was reported significantly up-regulated in breast cancer cells (PMID: 28629450) and have been identified as as an inflammation-associated protein relevant to breast tumor progression (PMID: 20101226).

There are few direct evidences showing Gamma-Aminobutyric Acid Type B Receptor Subunit 2(GABBR2) is related to breast cancer. However, several genome-wide screening (PMID: 31690011, PMID: 28591870) revealed that GABBR2 was up-regulated in Triple-Negative Breast Cancer.

Cell Division Cycle 25C(CDC25C) gene encodes a conserved protein that plays a key role in the regulation of cell division. Jeff C. Liu et al. (PMID: 29617654) reported that CDC25 as a common therapeutic target for Triple-Negative Breast Cancer. Cangi et al. (PMID: 10995786) also investigated CDC25 inhibitors as potential therapies for various cancers, including breast cancer.

The above is examples to analysis the top 5 genes. One the one hand, the results demonstrate the analysis methods are effective to identify genes and pathways related to breast cancer. On the other hand, for the genes which have not been identified playing unique roles in breast cancer, qPCR and other experiments should be conducted to verify the results.